In this practical you will get practice in choosing between, performing, and presenting the results of one- and two- sample t-tests and their non-parametric equivalants in R.
By actively following the lecture and practical and carrying out the independent study the successful student will be able to:
Workshops are not a test. It is expected that you often don’t know how to start, make a lot of mistakes and need help. Do not be put off and don’t let what you can not do interfere with what you can do. You will benefit from collaborating with others and/or discussing your results.
The lectures and the workshops are closely integrated and it is expected that you are familar with the lecture content before the workshop. You need not understand every detail as the workshop should build and consolidate your understanding. You may wish to refer to the slides as you work through the workshop schedule.
Start RStudio from the Start menu.
In RStudio, set your working directory to the folder you created previously for your 17C Data Analysis work.
Make a new script file called workshop5.R to carry out the rest of the work.
You probably want to load the tidyverse
with library(tidyverse)
.
The data in wasp.txt concern the egg-laying behaviour of a species of parasitic wasp, laying its eggs on a beetle larva. Wasps and other Hymenopterans (Ants and Bees) are haplo-diplod: unfertilised eggs are haploid and develop into males, whereas fertilised eggs are diploid and develop into females. Researchers wanted to know if mating status affected the time (in hours) the wasp takes to lay its eggs. Each row represents an individual wasp. The first column gives the time taken and the second column indicates whether they are mated or unmated.
Save a copy of the data file wasp.txt
Read in the data and check the structure
Do a quick plot of the data:
Find the means for each group. You may need to look this up Week 3: Testing, Data types and Reading in data
Create a data frame called waspsummary
that contains the means, standard deviations, sample sizes and standard errors for the mated and unmated females. You may want to look this up in the lecture notes.
Do you think this is a one/paired test or two-sample test?
Carry out a two-sample t-test:
##
## Two Sample t-test
##
## data: time by status
## t = -3.1407, df = 58, p-value = 0.002653
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -12.716797 -2.816536
## sample estimates:
## mean in group mated mean in group unmated
## 23.36667 31.13333
What do you conclude from the test? Write your conclusion in a form suitable for a report.
We need to calculate the residuals – the difference between predicted (i.e., the group mean) and observed values. We can do this in two steps:
First by adding a column that holds the mean for the group each value belongs to by ‘merging’ the summary data into the raw data:
I suggest you look at the wasp
dataframe to see what this code has done.
Second by adding a column for each ‘residual’
# add the group means to the data
# add the residuals
wasp <- wasp %>%
mutate(residual = time - mean)
Now we are ready to examine the distribution of the residuals.
Run a normality test:
##
## Shapiro-Wilk normality test
##
## data: wasp$residual
## W = 0.9749, p-value = 0.2516
Usually, when we are doing statistical tests we ‘hope’ we will find significance. In this case, we hope it will not be significant. A non-significant result means that there is no significant difference between the distribution of the residuals and a normal distribution.
What do you conclude from the result of the normality tests?
Check the residuals are homogenously distributed (variance is the same in both groups):
There’s a bit of an outlier in one group but this looks ‘ok’.
We are going to create a figure like this:
In this figure, we have the data points themselves which are in wasp
dataframe and the means and standard errors which are in the waspsummary
dataframe. That is, we have two dataframes we want to plot. Here you will learn that dataframes and aesthetics can be specified within a geom_...
(rather than in the ggplot()
) if the geom only applies to some of the data you want to plot.
We will build the plot up in small steps but as you get more used to ggplot
you’ll probably be able to create figures in fewer steps. Each time you run ggplot()
you get a new plot, you are not actually adding to an existing plot.
First, create an empty plot:
Now add the data points:
Notice how we have given the data argument and the aesthetics inside the geom. The variables status
and time
are in the wasp
dataframe
So the data points don’t overlap, we can add some random jitter in the x direction:
ggplot() +
geom_point(data = wasp, aes(x = status, y = time),
position = position_jitter(width = 0.1, height = 0))
We’ve set the vertical jitter to 0 because, in contrast to the categorical x-axis, movement on the y-axis has meaning (time).
Let’s make the points a light grey:
ggplot() +
geom_point(data = wasp, aes(x = status, y = time),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50")
Now to add the errorbars. These go from one standard error below the mean to one standard error above the mean.
Add a geom_errorbar()
for errorbars:
ggplot() +
geom_point(data = wasp, aes(x = status, y = time),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50") +
geom_errorbar(data = waspsummary,
aes(x = status, ymin = mean - se, ymax = mean + se),
width = 0.3)
We have specified the waspsummary
dataframe and the variables status
, mean
and se
are in that.
There are several ways you could add the mean. You could use geom_point()
but I like to use errorbars that start and stop in the same place, the mean.
Add a geom_errorbar()
for the mean:
ggplot() +
geom_point(data = wasp, aes(x = status, y = time),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50") +
geom_errorbar(data = waspsummary,
aes(x = status, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = waspsummary,
aes(x = status, ymin = mean, ymax = mean),
width = 0.3)
Alter the axis labels and limits:
ggplot() +
geom_point(data = wasp, aes(x = status, y = time),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50") +
geom_errorbar(data = waspsummary,
aes(x = status, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = waspsummary,
aes(x = status, ymin = mean, ymax = mean),
width = 0.3) +
ylab("Time (hr)") +
xlab(NULL) +
ylim(0, 70)
And finally format the figure in a way that is more suitable for including in a report:
ggplot() +
geom_point(data = wasp, aes(x = status, y = time),
position = position_jitter(width = 0.1, height = 0),
colour = "gray50") +
geom_errorbar(data = waspsummary,
aes(x = status, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = waspsummary,
aes(x = status, ymin = mean, ymax = mean),
width = 0.2) +
ylab("Time (hr)") +
xlab(NULL) +
ylim(0, 70) +
scale_x_discrete(labels = c("Mated", "Unmated")) +
theme_classic()
These data come from a sample of grouse shot in Scotland. The grouse livers were dissected and the number of individuals of a parasitic nematode were counted. We want to know if males and females have different infection rates.
males: 5, 16, 8, 64, 51, 11, 9, 7
females: 0, 2, 1, 3, 6, 10, 4, 12
Create dataframe for these data. Mine ended up like this . You might find it useful to look at workshop 2
Do these data look normally distributed?
What is the null hypothesis?
What test do you suggest?
Summarise the data by finding the median of each group.
## # A tibble: 2 x 2
## sex `median(nematodes)`
## <chr> <dbl>
## 1 females 3.5
## 2 males 10
Carry out a two-sample Wilcoxon test (also known as a Mann-Whitney):
##
## Wilcoxon rank sum test
##
## data: nematodes by sex
## W = 10, p-value = 0.02067
## alternative hypothesis: true location shift is not equal to 0
What do you conclude from the test? Write your conclusion in a form suitable for a report.
A box plot is a usually good choice for illustrating a two-sample Wilcoxon test because it shows the median and interquartile range.
We can create a boxplot with:
Researchers are interested in the expression levels of a particular set of 35 E.coli genes in response to heat stress. They measure the expression of the genes at 37 and 42 degrees C (labelled low and high temperature). These samples are not independent because we might expect there to be a relationship between expression at 37 and 42 degrees within a gene.
What is the null hypothesis?
Save a copy of coliexp.txt
Read the data in
What is the appropriate parametric test?
Now carry out a paired-sample t-test.
##
## Paired t-test
##
## data: expression by temperature
## t = 3.2039, df = 34, p-value = 0.002943
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1123712 0.5021946
## sample estimates:
## mean of the differences
## 0.3072829
State your conclusion from the test in a form suitable for including in a report. Make sure you give the direction of any signifcant effect.
Decide which test you should use to analyse the following data sets. In each case give the reasons for your choice of test and state the null hypothesis. Write your conclusions in a form suitable for including in a report. Can you make figures?
Some plant biotechnologists are trying to increase the quantity of omega 3 fatty acids in Cannabis sativa. They have developed a genetically modified line using genes from Linum usitatissimum (linseed). They grow 50 wild type and fifty modified plants to maturity, collect the seeds and determine the amount of omega 3 fatty acids. The data are in csativa.txt Do you think their modification has been successful?
In order to investigate the effects of feeding fertilised grass to sheep, one of each pair of fourteen sets of twins was fed fertilised grass whilst the other was fed unfertilised grass and the adult weight of the sheep was recorded. The data are in sheep.txt . Is there difference in the effect of fertilised and unfertilised grass on sheep weight?
These contain answers and code even though they do not appear on the webpage itself.
Rmd file The Rmd file is the file I use to compile the practical. Rmd stands for R markdown allow R code and ordinary text to be inter weaved to produce well-formatted reports including webpages.
Plain script file This is plain script (.R) version of the practical
This is an example of a well formatted analysis script for the wasp data.
Introduction to module and RStudio
Testing, Data types and reading in data
Goodness of Fit and Contingency chi-squared tests
Calculating summary statistics, probabilities and confidence intervals